RentSafeTO: Predicting Building Evaluation with Machine Learning¶


Author: Hyunjung Kim
Date: September 5, 2023

Table of Contents in this notebook

  1. Introduction
  2. EDA
  3. Modeling
  4. Conclusion

Introduction:¶

Welcome to our RentSafeTO project! In this capstone, we aim to predict the safety assessment of apartment buildings in Toronto using machine learning. Our goal is to provide valuable insights to property owners, empowering them to make informed decisions about housing maintain.

The Big Idea:

RentSafeTO revolves around analyzing factors that influence building safety. By considering variables such as building height, construction year, population density, laundry facilities, and waste disposal, our predictive model aims to forecast the safety evaluation outcomes.

The Impact:

The impact of RentSafeTO extends to multiple stakeholders. Prospective property buyers can benefit from a risk assessment before purchasing an apartment building. Tenants planning to move can access safety information to choose safer living environments. Landlords can proactively address safety concerns, leading to a more secure rental market.

The Data:

Our project utilizes data from the Toronto open data site, encompassing various apartment building evaluations. By analyzing this dataset, we aim to uncover patterns and correlations, enabling our predictive model to make accurate safety assessments.

EDA¶

  1. Sanity Check
  2. Clean Data
  3. visualization
In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns

Sometimes a warning comes out because it's slightly different from version to version.

In [2]:
from warnings import filterwarnings

filterwarnings('ignore')
In [3]:
# Read Data
df = pd.read_csv('../data/Apartment Building Evaluation.csv')

Sanity Check¶

It's time to check our data.

In [4]:
# Sanity Check
df.head()
Out[4]:
_id RSN YEAR_REGISTERED YEAR_EVALUATED YEAR_BUILT PROPERTY_TYPE WARD WARDNAME SITE_ADDRESS CONFIRMED_STOREYS ... EXTERIOR_WALKWAYS BALCONY_GUARDS WATER_PEN_EXT_BLDG_ELEMENTS PARKING_AREA OTHER_FACILITIES GRID LATITUDE LONGITUDE X Y
0 4167486 4304347 NaN NaN 1999.0 PRIVATE 2 Etobicoke Centre ** CREATED IN ERROR ** 399 THE WEST MALL 22 ... 5.0 5.0 5.0 5.0 5.0 W0233 43.643781 -79.565456 299503.625 4833538.964
1 4167487 5157421 2023.0 NaN 1973.0 TCHC 17 Don Valley North 6 TREE SPARROWAY 4 ... 3.0 5.0 4.0 3.0 4.0 N1721 43.791384 -79.369630 315272.148 4849932.515
2 4167488 5156814 2023.0 NaN 1973.0 TCHC 17 Don Valley North 13 FIELD SPARROWAY 4 ... 4.0 5.0 4.0 3.0 4.0 N1721 43.790920 -79.368771 315334.815 4849906.373
3 4167489 5157387 2023.0 NaN 1973.0 TCHC 17 Don Valley North 4 TREE SPARROWAY 4 ... 3.0 5.0 4.0 3.0 4.0 N1721 43.791448 -79.369332 315291.755 4849938.162
4 4167490 5156871 2023.0 NaN 1973.0 TCHC 17 Don Valley North 2 TREE SPARROWAY 4 ... 5.0 5.0 4.0 3.0 4.0 N1721 43.791511 -79.369045 315330.308 4849947.465

5 rows × 40 columns

In [5]:
df.shape
Out[5]:
(11760, 40)

In this dataset, we have 11760 rows and 40 coulumns.

In [6]:
for i in df.columns:
    print
    print(i)
_id
RSN
YEAR_REGISTERED
YEAR_EVALUATED
YEAR_BUILT
PROPERTY_TYPE
WARD
WARDNAME
SITE_ADDRESS
CONFIRMED_STOREYS
CONFIRMED_UNITS
EVALUATION_COMPLETED_ON
SCORE
RESULTS_OF_SCORE
NO_OF_AREAS_EVALUATED
ENTRANCE_LOBBY
ENTRANCE_DOORS_WINDOWS
SECURITY
STAIRWELLS
LAUNDRY_ROOMS
INTERNAL_GUARDS_HANDRAILS
GARBAGE_CHUTE_ROOMS
GARBAGE_BIN_STORAGE_AREA
ELEVATORS
STORAGE_AREAS_LOCKERS
INTERIOR_WALL_CEILING_FLOOR
INTERIOR_LIGHTING_LEVELS
GRAFFITI
EXTERIOR_CLADDING
EXTERIOR_GROUNDS
EXTERIOR_WALKWAYS
BALCONY_GUARDS
WATER_PEN_EXT_BLDG_ELEMENTS
PARKING_AREA
OTHER_FACILITIES
GRID
LATITUDE
LONGITUDE
X
Y

We have 40 columns, let's describe and sort those.

Columns:

  1. Building Information:

    • _id: Unique row identifier for Open Data database
    • RSN: Building Identifier
    • YEAR REGISTERED: Year of Registration in RentSafeTO
    • YEAR BUILT: Year the Building was Constructed
    • YEAR EVALUATED: Year of Building Evaluation
    • PROPERTY TYPE: Type of Building Ownership (Private, Toronto Community Housing Corporation, Other Assisted or Social Housing Providers)
    • NUMBERING OF PROPERTY: Appropriate size and visibility of property numbering
  2. Location Information:

    • WARD: Ward where the Building is Located
    • WARDNAME: Name of the Ward
    • SITE ADDRESS: Building Address
    • GRID: Administrative Area the Building Belongs to
    • LATITUDE
    • LONGITUDE
    • X
    • Y
  3. Building Details:

    • CONFIRMED STOREYS: Number of Storeys in the Building
    • CONFIRMED UNITS: Number of Units (Dwellings) in the Building
    • EVALUATION COMPLETED ON: Date of Building Evaluation
    • NO OF AREAS EVALUATED: Number of items evaluated during a single evaluation
    • SCORE: Weighted Average Score of the Building
    • RESULTS_OF_SCORE : Result of the score
  4. Exterior Maintenance:

    • EXTERIOR GROUNDS: Maintenance, cleanliness, landscaping, drainage, and lighting of exterior grounds
    • FENCING: Maintenance and materials of all fencing within the property
    • RETAINING WALLS: Maintenance and safety of all retaining walls
    • CATCH BASINS / STORM DRAINAGE: Maintenance and condition of catch basins and storm drainage systems
    • BUILDING EXTERIOR: Maintenance and safety of exterior walls, flashing, pipes, attachments, and balcony slabs
    • BALCONY GUARDS: Maintenance and safety of balcony guards
    • WINDOWS: Maintenance of all windows, safety devices, and window screens
    • EXT. RECEPTACLE STORAGE AREA: Maintenance and cleanliness of exterior waste storage areas
    • EXTERIOR WALKWAYS: Maintenance, cleanliness, drainage, and safety of exterior walkways
    • CLOTHING DROP BOXES: Maintenance and safety of clothing drop boxes
    • ACCESSORY BUILDINGS: Maintenance and safety of additional buildings or structures
    • WATER_PEN_EXT_BLDG_ELEMENTS: measures water infiltration in building components.
  5. Interior Maintenance:

    • INTERCOM: Maintenance and operability of intercoms
    • EMERGENCY CONTACT SIGN: Maintenance of emergency contact signs
    • LOBBY - WALLS AND CEILING: Maintenance and safety of lobby walls and ceilings
    • LOBBY FLOORS: Maintenance and safety of lobby floors
    • LAUNDRY ROOM: Maintenance, operability, and lighting of laundry rooms
    • INT. RECEPTACLE STORAGE AREA: Maintenance and cleanliness of interior waste storage areas
    • MAIL RECEPTACLES: Maintenance and safety of mailboxes
    • EXTERIOR DOORS: Maintenance and operability of exterior doors
    • STORAGE AREAS/LOCKERS MAINT.: Maintenance of storage areas/lockers
    • ELEVATOR MAINTENANCE: Maintenance of elevators to keep them in good repair
    • ELEVATOR COSMETICS: Maintenance and condition of elevator parts and attachments
  6. Common Area Maintenance:

    • COMMON AREA VENTILATION: Maintenance of ventilation systems in common areas
    • ELECTRICAL SERVICES / OUTLETS: Maintenance of electrical fixtures, switches, receptacles, and connections
    • CHUTE ROOMS - MAINTENANCE: Maintenance and operability of chute rooms
    • STAIRWELL - WALLS AND CEILING: Maintenance and safety of stairwell walls and ceilings
    • STAIRWELL - LANDING AND STEPS: Maintenance and safety of stairwell landings and steps
    • STAIRWELL LIGHTING: Maintenance and safety of stairwell lighting
    • INT. HANDRAIL / GUARD - SAFETY: Safety of interior handrails and guards
    • INT. HANDRAIL / GUARD - MAINT.: Maintenance and cleanliness of interior handrails and guards
  7. Building Hygiene:

    • BUILDING CLEANLINESS: Keeping common areas clean according to standards
    • COMMON AREA PESTS: Handling pests in common areas
    • GRAFFITI: Handling of graffiti in common areas or exterior grounds
  8. Building Services:

    • TENANT NOTIFICATION BOARD: Displaying important notices for tenants
    • PEST CONTROL LOG: Keeping a log of pest inspections and treatments
    • MAINTENANCE LOG: Keeping a log of service and maintenance on various systems
    • CLEANING LOG: Keeping a log of scheduled or unscheduled cleaning activities
    • VITAL SERVICE PLAN: Maintaining a plan for essential services to tenants
    • ELECTRICAL SAFETY PLAN: Maintaining an electrical maintenance plan
    • STATE OF GOOD REPAIR PLAN: Maintaining a plan for repairs and improvements
    • TENANT SERVICE REQUEST LOG: Keeping a log of tenant service requests
  9. Others:

    • PARKING AREAS: Maintenance and cleanliness of parking areas
    • POOLS: Maintenance and access to pools, if applicable
    • OTHER AMENITIES: Maintenance and access to community rooms, play areas, gyms, or tennis courts
    • ABANDONED EQUIP./DERELICT VEH.: Handling abandoned equipment and derelict vehicles
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11760 entries, 0 to 11759
Data columns (total 40 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   _id                          11760 non-null  int64  
 1   RSN                          11760 non-null  int64  
 2   YEAR_REGISTERED              11455 non-null  float64
 3   YEAR_EVALUATED               9751 non-null   float64
 4   YEAR_BUILT                   11714 non-null  float64
 5   PROPERTY_TYPE                11760 non-null  object 
 6   WARD                         11760 non-null  int64  
 7   WARDNAME                     11760 non-null  object 
 8   SITE_ADDRESS                 11760 non-null  object 
 9   CONFIRMED_STOREYS            11760 non-null  int64  
 10  CONFIRMED_UNITS              11760 non-null  int64  
 11  EVALUATION_COMPLETED_ON      11760 non-null  object 
 12  SCORE                        11760 non-null  int64  
 13  RESULTS_OF_SCORE             11760 non-null  object 
 14  NO_OF_AREAS_EVALUATED        11760 non-null  int64  
 15  ENTRANCE_LOBBY               11758 non-null  float64
 16  ENTRANCE_DOORS_WINDOWS       11759 non-null  float64
 17  SECURITY                     11754 non-null  float64
 18  STAIRWELLS                   11757 non-null  float64
 19  LAUNDRY_ROOMS                11104 non-null  float64
 20  INTERNAL_GUARDS_HANDRAILS    11757 non-null  float64
 21  GARBAGE_CHUTE_ROOMS          5102 non-null   float64
 22  GARBAGE_BIN_STORAGE_AREA     11749 non-null  float64
 23  ELEVATORS                    6897 non-null   float64
 24  STORAGE_AREAS_LOCKERS        4773 non-null   float64
 25  INTERIOR_WALL_CEILING_FLOOR  11758 non-null  float64
 26  INTERIOR_LIGHTING_LEVELS     11758 non-null  float64
 27  GRAFFITI                     11721 non-null  float64
 28  EXTERIOR_CLADDING            11751 non-null  float64
 29  EXTERIOR_GROUNDS             11745 non-null  float64
 30  EXTERIOR_WALKWAYS            11754 non-null  float64
 31  BALCONY_GUARDS               7973 non-null   float64
 32  WATER_PEN_EXT_BLDG_ELEMENTS  11754 non-null  float64
 33  PARKING_AREA                 10704 non-null  float64
 34  OTHER_FACILITIES             2254 non-null   float64
 35  GRID                         11760 non-null  object 
 36  LATITUDE                     11533 non-null  float64
 37  LONGITUDE                    11533 non-null  float64
 38  X                            11671 non-null  float64
 39  Y                            11671 non-null  float64
dtypes: float64(27), int64(7), object(6)
memory usage: 3.6+ MB

We can recognize that they have NA in some columns and the type of some columns should be adjusted. The Float type columns will change to Int after clean NA rows.

In [8]:
df['EVALUATION_COMPLETED_ON'] = pd.to_datetime(df['EVALUATION_COMPLETED_ON']).dt.year

Change to year because it is more efficient to represent only the year rather than the exact date evaluated.

In [9]:
df.duplicated().sum()
Out[9]:
0

And there is not duplicated row each one as unique.

In [10]:
for i in df.columns:
    print('number of distinct in', i)
    print(':', df[i].nunique())
number of distinct in _id
: 11760
number of distinct in RSN
: 3513
number of distinct in YEAR_REGISTERED
: 7
number of distinct in YEAR_EVALUATED
: 5
number of distinct in YEAR_BUILT
: 130
number of distinct in PROPERTY_TYPE
: 3
number of distinct in WARD
: 25
number of distinct in WARDNAME
: 25
number of distinct in SITE_ADDRESS
: 3513
number of distinct in CONFIRMED_STOREYS
: 40
number of distinct in CONFIRMED_UNITS
: 383
number of distinct in EVALUATION_COMPLETED_ON
: 7
number of distinct in SCORE
: 66
number of distinct in RESULTS_OF_SCORE
: 4
number of distinct in NO_OF_AREAS_EVALUATED
: 11
number of distinct in ENTRANCE_LOBBY
: 5
number of distinct in ENTRANCE_DOORS_WINDOWS
: 5
number of distinct in SECURITY
: 5
number of distinct in STAIRWELLS
: 5
number of distinct in LAUNDRY_ROOMS
: 5
number of distinct in INTERNAL_GUARDS_HANDRAILS
: 5
number of distinct in GARBAGE_CHUTE_ROOMS
: 5
number of distinct in GARBAGE_BIN_STORAGE_AREA
: 5
number of distinct in ELEVATORS
: 5
number of distinct in STORAGE_AREAS_LOCKERS
: 5
number of distinct in INTERIOR_WALL_CEILING_FLOOR
: 5
number of distinct in INTERIOR_LIGHTING_LEVELS
: 5
number of distinct in GRAFFITI
: 5
number of distinct in EXTERIOR_CLADDING
: 5
number of distinct in EXTERIOR_GROUNDS
: 5
number of distinct in EXTERIOR_WALKWAYS
: 5
number of distinct in BALCONY_GUARDS
: 5
number of distinct in WATER_PEN_EXT_BLDG_ELEMENTS
: 5
number of distinct in PARKING_AREA
: 5
number of distinct in OTHER_FACILITIES
: 5
number of distinct in GRID
: 327
number of distinct in LATITUDE
: 3414
number of distinct in LONGITUDE
: 3414
number of distinct in X
: 3473
number of distinct in Y
: 3473

Most of the columns are scored from 1 to 5. This means that the columns with scores are the main sources used for judgment.

In [11]:
# There are four following results as a result of the evaluation score.

df['RESULTS_OF_SCORE'].value_counts()
Out[11]:
RESULTS_OF_SCORE
Evaluation needs to be conducted in 2 years    7396
Evaluation needs to be conducted in 1 year     2619
Evaluation needs to be conducted in 3 years    1628
Building Audit                                  117
Name: count, dtype: int64

As a target 'SCORE' shows the results in score, and 'RESULTS_OF_SCORE' tells the results of the revaluation period that it receives.

Clean Data¶

When we check our data, we saw the nessesity to clean. Let's clean.

In [12]:
df.isna().sum()/df.shape[0]*100
Out[12]:
_id                             0.000000
RSN                             0.000000
YEAR_REGISTERED                 2.593537
YEAR_EVALUATED                 17.083333
YEAR_BUILT                      0.391156
PROPERTY_TYPE                   0.000000
WARD                            0.000000
WARDNAME                        0.000000
SITE_ADDRESS                    0.000000
CONFIRMED_STOREYS               0.000000
CONFIRMED_UNITS                 0.000000
EVALUATION_COMPLETED_ON         0.000000
SCORE                           0.000000
RESULTS_OF_SCORE                0.000000
NO_OF_AREAS_EVALUATED           0.000000
ENTRANCE_LOBBY                  0.017007
ENTRANCE_DOORS_WINDOWS          0.008503
SECURITY                        0.051020
STAIRWELLS                      0.025510
LAUNDRY_ROOMS                   5.578231
INTERNAL_GUARDS_HANDRAILS       0.025510
GARBAGE_CHUTE_ROOMS            56.615646
GARBAGE_BIN_STORAGE_AREA        0.093537
ELEVATORS                      41.352041
STORAGE_AREAS_LOCKERS          59.413265
INTERIOR_WALL_CEILING_FLOOR     0.017007
INTERIOR_LIGHTING_LEVELS        0.017007
GRAFFITI                        0.331633
EXTERIOR_CLADDING               0.076531
EXTERIOR_GROUNDS                0.127551
EXTERIOR_WALKWAYS               0.051020
BALCONY_GUARDS                 32.202381
WATER_PEN_EXT_BLDG_ELEMENTS     0.051020
PARKING_AREA                    8.979592
OTHER_FACILITIES               80.833333
GRID                            0.000000
LATITUDE                        1.930272
LONGITUDE                       1.930272
X                               0.756803
Y                               0.756803
dtype: float64

As we can see,

  1. we can see that we have 17% of null in 'YEAR_EVALUATED' however the year of this columns is presented in 'EVALUATION_COMPLETED_ON'.
  2. 'GARBAGE_CHUTE_ROOMS'(56.62%), 'ELEVATORS'(41.35%) 'STORAGE_AREAS_LOCKERS'(59.41%), and 'OTHER_FACILITIES'(80.83%) contains missing values over 40%.
  3. 'X' and 'Y' columns represent same informations as the 'LATITUDE' and 'LONGTITUDE'
  4. 'WARDNAME' present same information with 'WARD'.
  5. 'GRID', 'SITE_ADDRESS' are location data however we already have lon and lat.
  6. '_id' and 'RSA' are columns for identification.

It mean we can drop those columns because cleaned dataset make more accurate analysis.

In [13]:
df_clean = df.drop(['RSN','WARD','SITE_ADDRESS','YEAR_EVALUATED','X','Y','GARBAGE_CHUTE_ROOMS', 'ELEVATORS', 'STORAGE_AREAS_LOCKERS', 'OTHER_FACILITIES'], axis=1)
In [14]:
# check again
df_clean.isna().sum()/df_clean.shape[0]*100
Out[14]:
_id                             0.000000
YEAR_REGISTERED                 2.593537
YEAR_BUILT                      0.391156
PROPERTY_TYPE                   0.000000
WARDNAME                        0.000000
CONFIRMED_STOREYS               0.000000
CONFIRMED_UNITS                 0.000000
EVALUATION_COMPLETED_ON         0.000000
SCORE                           0.000000
RESULTS_OF_SCORE                0.000000
NO_OF_AREAS_EVALUATED           0.000000
ENTRANCE_LOBBY                  0.017007
ENTRANCE_DOORS_WINDOWS          0.008503
SECURITY                        0.051020
STAIRWELLS                      0.025510
LAUNDRY_ROOMS                   5.578231
INTERNAL_GUARDS_HANDRAILS       0.025510
GARBAGE_BIN_STORAGE_AREA        0.093537
INTERIOR_WALL_CEILING_FLOOR     0.017007
INTERIOR_LIGHTING_LEVELS        0.017007
GRAFFITI                        0.331633
EXTERIOR_CLADDING               0.076531
EXTERIOR_GROUNDS                0.127551
EXTERIOR_WALKWAYS               0.051020
BALCONY_GUARDS                 32.202381
WATER_PEN_EXT_BLDG_ELEMENTS     0.051020
PARKING_AREA                    8.979592
GRID                            0.000000
LATITUDE                        1.930272
LONGITUDE                       1.930272
dtype: float64

The rest of the NA will be replaced in a different way. Firstly, It's better to look at the items evaluated from 1 to 5.

In [15]:
# Columns evaluated in five digits.

df_scored = df_clean[['ENTRANCE_LOBBY',               
'ENTRANCE_DOORS_WINDOWS',       
'SECURITY',                     
'STAIRWELLS',                   
'LAUNDRY_ROOMS',                
'INTERNAL_GUARDS_HANDRAILS',    
'GARBAGE_BIN_STORAGE_AREA',                        
'INTERIOR_WALL_CEILING_FLOOR',  
'INTERIOR_LIGHTING_LEVELS',     
'GRAFFITI',                     
'EXTERIOR_CLADDING',            
'EXTERIOR_GROUNDS',             
'EXTERIOR_WALKWAYS',            
'BALCONY_GUARDS',               
'WATER_PEN_EXT_BLDG_ELEMENTS',  
'PARKING_AREA']]
In [16]:
df_scored.describe()
Out[16]:
ENTRANCE_LOBBY ENTRANCE_DOORS_WINDOWS SECURITY STAIRWELLS LAUNDRY_ROOMS INTERNAL_GUARDS_HANDRAILS GARBAGE_BIN_STORAGE_AREA INTERIOR_WALL_CEILING_FLOOR INTERIOR_LIGHTING_LEVELS GRAFFITI EXTERIOR_CLADDING EXTERIOR_GROUNDS EXTERIOR_WALKWAYS BALCONY_GUARDS WATER_PEN_EXT_BLDG_ELEMENTS PARKING_AREA
count 11758.000000 11759.000000 11754.000000 11757.000000 11104.000000 11757.000000 11749.000000 11758.000000 11758.000000 11721.000000 11751.000000 11745.000000 11754.000000 7973.000000 11754.000000 10704.000000
mean 3.713642 3.675313 4.126425 3.453857 3.575919 3.603640 3.607201 3.492686 3.672393 4.610869 3.549060 3.650575 3.643866 3.752665 3.668453 3.392096
std 0.775948 0.770057 0.877997 0.787374 0.794015 0.830116 0.782764 0.767906 0.878231 0.755874 0.718478 0.754074 0.744887 0.833194 0.739714 0.757125
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 4.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000
50% 4.000000 4.000000 4.000000 3.000000 4.000000 4.000000 4.000000 3.000000 4.000000 5.000000 4.000000 4.000000 4.000000 4.000000 4.000000 3.000000
75% 4.000000 4.000000 5.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 5.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000
max 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000

Most of scores are over 3, but we should care the scores under 3 because we should find an environment where we get low scores, result is the audit.

32.20% of the value of 'BALCONY_GUARDS', 5.58% of the value of 'LAUNDRY_ROOMS', and 8.98% of the value of 'PARKING_AREA' are missing. We will replace those with the mode

Additionaly, All other rows with missing values will be dropped.

In [17]:
most_freq1 = df_clean['BALCONY_GUARDS'].value_counts().idxmax()
most_freq2 = df_clean['LAUNDRY_ROOMS'].value_counts().idxmax()
most_freq3 = df_clean['PARKING_AREA'].value_counts().idxmax()

df_clean['BALCONY_GUARDS'].fillna(most_freq1, inplace = True)
df_clean['LAUNDRY_ROOMS'].fillna(most_freq2, inplace = True)
df_clean['PARKING_AREA'].fillna(most_freq3, inplace = True)
In [18]:
df_clean.dropna(inplace=True)
In [19]:
# Clean data sanity check
df_clean.isna().sum()/df_clean.shape[0]*100
Out[19]:
_id                            0.0
YEAR_REGISTERED                0.0
YEAR_BUILT                     0.0
PROPERTY_TYPE                  0.0
WARDNAME                       0.0
CONFIRMED_STOREYS              0.0
CONFIRMED_UNITS                0.0
EVALUATION_COMPLETED_ON        0.0
SCORE                          0.0
RESULTS_OF_SCORE               0.0
NO_OF_AREAS_EVALUATED          0.0
ENTRANCE_LOBBY                 0.0
ENTRANCE_DOORS_WINDOWS         0.0
SECURITY                       0.0
STAIRWELLS                     0.0
LAUNDRY_ROOMS                  0.0
INTERNAL_GUARDS_HANDRAILS      0.0
GARBAGE_BIN_STORAGE_AREA       0.0
INTERIOR_WALL_CEILING_FLOOR    0.0
INTERIOR_LIGHTING_LEVELS       0.0
GRAFFITI                       0.0
EXTERIOR_CLADDING              0.0
EXTERIOR_GROUNDS               0.0
EXTERIOR_WALKWAYS              0.0
BALCONY_GUARDS                 0.0
WATER_PEN_EXT_BLDG_ELEMENTS    0.0
PARKING_AREA                   0.0
GRID                           0.0
LATITUDE                       0.0
LONGITUDE                      0.0
dtype: float64
In [20]:
float_columns = [
    'YEAR_REGISTERED', 
    'YEAR_BUILT',
    'ENTRANCE_LOBBY',
    'ENTRANCE_DOORS_WINDOWS',
    'SECURITY',
    'STAIRWELLS',
    'LAUNDRY_ROOMS',
    'INTERNAL_GUARDS_HANDRAILS',
    'GARBAGE_BIN_STORAGE_AREA',
    'INTERIOR_WALL_CEILING_FLOOR',
    'INTERIOR_LIGHTING_LEVELS',
    'GRAFFITI',
    'EXTERIOR_CLADDING',
    'EXTERIOR_GROUNDS',
    'EXTERIOR_WALKWAYS',
    'BALCONY_GUARDS',
    'WATER_PEN_EXT_BLDG_ELEMENTS',
    'PARKING_AREA']
df_clean[float_columns] = df_clean[float_columns].astype(int)

Now that there is no NA, we will fix the data type.

In [21]:
df_clean.info()
<class 'pandas.core.frame.DataFrame'>
Index: 11152 entries, 1 to 11759
Data columns (total 30 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   _id                          11152 non-null  int64  
 1   YEAR_REGISTERED              11152 non-null  int64  
 2   YEAR_BUILT                   11152 non-null  int64  
 3   PROPERTY_TYPE                11152 non-null  object 
 4   WARDNAME                     11152 non-null  object 
 5   CONFIRMED_STOREYS            11152 non-null  int64  
 6   CONFIRMED_UNITS              11152 non-null  int64  
 7   EVALUATION_COMPLETED_ON      11152 non-null  int32  
 8   SCORE                        11152 non-null  int64  
 9   RESULTS_OF_SCORE             11152 non-null  object 
 10  NO_OF_AREAS_EVALUATED        11152 non-null  int64  
 11  ENTRANCE_LOBBY               11152 non-null  int64  
 12  ENTRANCE_DOORS_WINDOWS       11152 non-null  int64  
 13  SECURITY                     11152 non-null  int64  
 14  STAIRWELLS                   11152 non-null  int64  
 15  LAUNDRY_ROOMS                11152 non-null  int64  
 16  INTERNAL_GUARDS_HANDRAILS    11152 non-null  int64  
 17  GARBAGE_BIN_STORAGE_AREA     11152 non-null  int64  
 18  INTERIOR_WALL_CEILING_FLOOR  11152 non-null  int64  
 19  INTERIOR_LIGHTING_LEVELS     11152 non-null  int64  
 20  GRAFFITI                     11152 non-null  int64  
 21  EXTERIOR_CLADDING            11152 non-null  int64  
 22  EXTERIOR_GROUNDS             11152 non-null  int64  
 23  EXTERIOR_WALKWAYS            11152 non-null  int64  
 24  BALCONY_GUARDS               11152 non-null  int64  
 25  WATER_PEN_EXT_BLDG_ELEMENTS  11152 non-null  int64  
 26  PARKING_AREA                 11152 non-null  int64  
 27  GRID                         11152 non-null  object 
 28  LATITUDE                     11152 non-null  float64
 29  LONGITUDE                    11152 non-null  float64
dtypes: float64(2), int32(1), int64(23), object(4)
memory usage: 2.6+ MB

I'm finally done cleaning!

Data Cleaning (Summary):

  1. The dataset consists of 11,760 rows and 40 columns.
  2. There are no duplicate rows, but some columns contain duplicated information.
  3. Duplicate columns were removed during the data cleaning process.
  4. Columns with a missing value of 50% or more were deleted.
  5. some columns have similar information so those were deleted.

visualization¶

We will do visualization to look around the data efficiently.

In [22]:
import matplotlib.pyplot as plt

Columns were classified before running EDA. This is because using 40 rows for analysis at once is inefficient and cannot guarantee accurate results.

  1. About building Information
  2. About Maintenance (Exterior, Interior, and Common Area)

Of course, hygiene or additional facilities can also be indirectly important. But we want to look at more direct factors for safety.

Relationship between basic apartment information and score, not evaluation score

  • Numerical data - continuous

    • 'YEAR_REGISTERED',
    • 'YEAR_BUILT'
  • Numerical data - discrete

    • 'CONFIRMED_STOREYS',
    • 'CONFIRMED_UNITS',
    • 'NO_OF_AREAS_EVALUATED'
  • Categorical - nominal

    • 'PROPERTY_TYPE',
    • 'WARDNAME'
  • target Variable

    • 'SCORE'(Numerical target),
    • 'RESULTS_OF_SCORE'(Categorical target)
In [23]:
from scipy import stats
import statsmodels.api as sm

Y = df_clean['SCORE'].values

for i in ['YEAR_REGISTERED', 'YEAR_BUILT', 'CONFIRMED_STOREYS', 'CONFIRMED_UNITS', 'NO_OF_AREAS_EVALUATED']:
    print('*', i)
    X=df_clean[i].values
    print('Correlation:', stats.pearsonr(X,Y)[0])
    print('P-value:', stats.pearsonr(X,Y)[1])
    print('\n')
* YEAR_REGISTERED
Correlation: -0.04694172099864394
P-value: 7.072356943071497e-07


* YEAR_BUILT
Correlation: 0.16873709868865364
P-value: 5.160153615173216e-72


* CONFIRMED_STOREYS
Correlation: 0.12568179491929196
P-value: 1.6842857551320354e-40


* CONFIRMED_UNITS
Correlation: 0.09936299669697163
P-value: 7.145319371979502e-26


* NO_OF_AREAS_EVALUATED
Correlation: 0.23153736362899446
P-value: 1.274036340575308e-135


Each of the P-value of the columns related to apartment information is significantly small. It means it is worth

In [24]:
plt.figure()

df_clean.boxplot(column=['SCORE'], by=['YEAR_REGISTERED'])

plt.show()
<Figure size 640x480 with 0 Axes>

Looking at the boxplot of the scores by year of registration, the median value is similar, but the older the apartment is registered, the more outliers the lower the score. Data for 2023 looks different because it is not yet a year old.

In [25]:
plt.figure()
Y = df_clean['SCORE'].values
X = df_clean['YEAR_BUILT'].values
plt.scatter(X, Y, alpha=0.3)
plt.title('YEAR_BUILT')
plt.show()

Although it is difficult to find any noticeable characteristics, it can be seen that the score of buildings built around the 2000s is less than 60 points compared to those around the 1950s.

In [26]:
Y = df_clean['SCORE'].values
X = df_clean['CONFIRMED_STOREYS'].values
plt.scatter(X, Y, alpha=0.5)
plt.title('CONFIRMED_STOREYS')
plt.show()
In [27]:
Y = df_clean['SCORE'].values
X = df_clean['CONFIRMED_UNITS'].values
plt.scatter(X, Y, alpha=0.5)
plt.title('CONFIRMED_UNITS')
plt.show()

It can be seen that the larger the number of units and storeys, the narrower the radius of the score distribution. Also, the lower the number, the lower the score.

In [28]:
df_clean.boxplot(column=['SCORE'], by=['NO_OF_AREAS_EVALUATED'])
Out[28]:
<Axes: title={'center': 'SCORE'}, xlabel='[NO_OF_AREAS_EVALUATED]'>

Let's look at the category data.

  • 'PROPERTY_TYPE'
  • 'WARDNAME'
In [29]:
plt.figure()
plt.title('PROPERTY_TYPE')

property_counts = df_clean['PROPERTY_TYPE'].value_counts()

ax = df_clean['PROPERTY_TYPE'].value_counts().plot.bar()

for i, count in enumerate(property_counts):
    ax.annotate(str(count), xy=(i, count), ha='center', va='bottom')
    
plt.show()

Most buildings are privately owned.

In [30]:
plt.figure(figsize = (12,6))
plt.title('WARDNAME')

wardname_counts = df_clean['WARDNAME'].value_counts()

ax = df_clean['WARDNAME'].value_counts().plot.bar()

for i, count in enumerate(wardname_counts):
    ax.annotate(str(count), xy=(i, count), ha='center', va='bottom')
#plt.savefig('WARDNAME.png')
plt.show()

"Toronto-St. Paul's" has the most buildings, followed by "Eglinton-Lawrence" and "Etobicoke-Lakeshore".

Another package is needed for map visualization.

In [31]:
!pip install folium
Requirement already satisfied: folium in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (0.14.0)
Requirement already satisfied: numpy in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from folium) (1.23.5)
Requirement already satisfied: jinja2>=2.9 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from folium) (3.1.2)
Requirement already satisfied: requests in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from folium) (2.28.1)
Requirement already satisfied: branca>=0.6.0 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from folium) (0.6.0)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from jinja2>=2.9->folium) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from requests->folium) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from requests->folium) (1.26.14)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from requests->folium) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/jaysworld/anaconda3/lib/python3.10/site-packages (from requests->folium) (2023.7.22)
In [32]:
import folium
In [33]:
# Latitude of Toronto
latitude = 43.651070

# Longtitude of Toronto
longitude = -79.347015

# the points of apartments
location_score = df_clean[['LATITUDE', 'LONGITUDE','SCORE']]
In [34]:
from folium.plugins import MarkerCluster

m = folium.Map(location=[latitude, longitude],
              zoom_start=13,
              width=750,
              height=500
              )

# CN Tower location
folium.Marker([43.642567, -79.387054],
             popup='CN Tower',
             tooltip='Landmark of Toronto').add_to(m)

location = df_clean[['LATITUDE', 'LONGITUDE']]

marker_cluster = MarkerCluster().add_to(m)

for lat, long in zip(location['LATITUDE'], location['LONGITUDE']):
    folium.Marker([lat, long], icon = folium.Icon(color="green")).add_to(marker_cluster)

# m.save('zoom.html')
    
m
Out[34]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [35]:
from folium.plugins import HeatMap

m = folium.Map(location=[latitude, longitude],
              zoom_start=13,
              width=750,
              height=500
              )

# CN Tower location
folium.Marker([43.642567, -79.387054],
             popup='CN Tower',
             tooltip='Landmark of Toronto').add_to(m)

location = df_clean[['LATITUDE', 'LONGITUDE']]

data = location.values.tolist()

heatmap = HeatMap(data,
                 min_opacity=0.05, 
                max_opacity=0.9, 
                radius=25)

heatmap.add_to(m)

# m.save('heatmap.html')

m
Out[35]:
Make this Notebook Trusted to load map: File -> Trust Notebook

This time we will look at the EDA for the items evaluated.

In [36]:
subset_columns = ['ENTRANCE_LOBBY', 'ENTRANCE_DOORS_WINDOWS', 'SECURITY', 'STAIRWELLS', 'LAUNDRY_ROOMS', 
                  'INTERNAL_GUARDS_HANDRAILS', 'GARBAGE_BIN_STORAGE_AREA', 'INTERIOR_WALL_CEILING_FLOOR', 
                  'INTERIOR_LIGHTING_LEVELS', 'GRAFFITI', 'EXTERIOR_CLADDING', 'EXTERIOR_GROUNDS', 
                  'EXTERIOR_WALKWAYS', 'BALCONY_GUARDS', 'WATER_PEN_EXT_BLDG_ELEMENTS', 'PARKING_AREA']

subset_df = df_clean[subset_columns]

sns.set(style="whitegrid")

f, axes = plt.subplots(4, 4, figsize=(15, 12))
axes = axes.ravel()

# Plot histograms for each column
for i, col in enumerate(subset_columns):
    sns.histplot(data=subset_df, x=col, ax=axes[i], kde=True)
    axes[i].set_title(col)
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()

plt.show()

Modeling¶

  1. Linear Regression Models
    1. Linear Regression in linear_model
    2. Linear Regression in statsmodels
    3. RandomForestRegression
  2. Classificaation Models
    1. Logistic Regression
    2. Decision Tree
    3. K-Nearest Neighbor

Linear Regression Models¶

The name will be made starting with a unit of 00.

LinearRegression in linear_model¶

In [37]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In this model, the variable name is 1

In [38]:
X1 = df_clean[['YEAR_REGISTERED', 'YEAR_BUILT', 'CONFIRMED_STOREYS', 'CONFIRMED_UNITS', 'NO_OF_AREAS_EVALUATED']]
y1 = df_clean['SCORE']
In [39]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1, y1, test_size=0.2, random_state=42)
In [40]:
#model
model1 = LinearRegression()

#fit
model1.fit(X_train1, y_train1)

#predict
y_pred1 = model1.predict(X_test1)
y_pred1 
Out[40]:
array([71.09875189, 76.2080582 , 75.62945901, ..., 69.52470475,
       73.78822268, 74.07809092])
In [41]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

# Calculate the Mean Absolute Error
mae_pred1 = mean_absolute_error(y_test1, y_pred1) 

# Calculate the Mean Squared Error
mse1 = mean_squared_error(y_pred1, y_test1)

# Calculate the Root Mean Square Error
rmse1 = mean_squared_error(y_test1, y_pred1, squared=False)

print('MAE:',mae_pred1)
print('MSE:',mse1)
print('RMSE:',rmse1)
MAE: 8.158775554020059
MSE: 100.2624994530014
RMSE: 10.013116370691064

The MAE obtained the average of the errors of the predicted value is 8, which seems to be a small number in SCORE, where 100 is the highest. However, considering that the width of most score distributions is small, it is difficult to judge that it is accurate. Let's create more diverse models.

In [42]:
plt.figure()

plt.scatter(y_test1, y_pred1, alpha=0.4)
plt.xlabel('Actual Score')
plt.ylabel('Predicted Score')
plt.title('Multiple Linear Regression')

plt.show()

The above chart shows the prediction and actual data of linear regression analysis with numerical data. Unfortunately, the prediction is not successful because it does not have a linear shape.

Linear Regression in statsmodels¶

In [43]:
import statsmodels.api as sm

In the df_clean, we have columns of identification, location information. However, they are not required for linear regression models, so we will only use what we need in the model.

In this model, the variable name is 2

In [44]:
# They are columns that have no redundancy and have unique values used for evaluation.
eva_columns2 = df_clean[['YEAR_REGISTERED',               
'YEAR_BUILT',                    
'CONFIRMED_STOREYS',             
'CONFIRMED_UNITS',               
'NO_OF_AREAS_EVALUATED',         
'ENTRANCE_LOBBY',                
'ENTRANCE_DOORS_WINDOWS',        
'SECURITY',                      
'STAIRWELLS',                    
'LAUNDRY_ROOMS',                 
'INTERNAL_GUARDS_HANDRAILS',     
'GARBAGE_BIN_STORAGE_AREA',      
'INTERIOR_WALL_CEILING_FLOOR',   
'INTERIOR_LIGHTING_LEVELS',      
'GRAFFITI',                      
'EXTERIOR_CLADDING',             
'EXTERIOR_GROUNDS',              
'EXTERIOR_WALKWAYS',             
'BALCONY_GUARDS',               
'WATER_PEN_EXT_BLDG_ELEMENTS',   
'PARKING_AREA']]
In [45]:
X2 = eva_columns2
y2 = df_clean['SCORE']
In [46]:
X2_withconstant2 = sm.add_constant(X2)
In [47]:
# 1. Instantiate Model
myregression2 = sm.OLS(y2, X2_withconstant2)

# 2. Fit Model (this returns a seperate object with the parameters)
myregression_results2 = myregression2.fit()

# Looking at the summary
myregression_results2.summary()
Out[47]:
OLS Regression Results
Dep. Variable: SCORE R-squared: 0.989
Model: OLS Adj. R-squared: 0.989
Method: Least Squares F-statistic: 4.938e+04
Date: Wed, 06 Sep 2023 Prob (F-statistic): 0.00
Time: 21:24:40 Log-Likelihood: -16586.
No. Observations: 11152 AIC: 3.322e+04
Df Residuals: 11130 BIC: 3.338e+04
Df Model: 21
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -59.6513 35.145 -1.697 0.090 -128.542 9.240
YEAR_REGISTERED 0.0274 0.017 1.573 0.116 -0.007 0.061
YEAR_BUILT 0.0028 0.001 4.198 0.000 0.001 0.004
CONFIRMED_STOREYS 0.0102 0.003 2.986 0.003 0.004 0.017
CONFIRMED_UNITS -1.821e-05 0.000 -0.082 0.934 -0.000 0.000
NO_OF_AREAS_EVALUATED -0.1220 0.009 -13.913 0.000 -0.139 -0.105
ENTRANCE_LOBBY 1.4649 0.020 72.374 0.000 1.425 1.505
ENTRANCE_DOORS_WINDOWS 1.2556 0.019 67.197 0.000 1.219 1.292
SECURITY 1.2852 0.015 85.412 0.000 1.256 1.315
STAIRWELLS 1.3693 0.019 73.234 0.000 1.333 1.406
LAUNDRY_ROOMS 1.3357 0.017 76.329 0.000 1.301 1.370
INTERNAL_GUARDS_HANDRAILS 1.3049 0.015 86.773 0.000 1.275 1.334
GARBAGE_BIN_STORAGE_AREA 1.3553 0.016 83.072 0.000 1.323 1.387
INTERIOR_WALL_CEILING_FLOOR 1.3242 0.019 71.316 0.000 1.288 1.361
INTERIOR_LIGHTING_LEVELS 1.3151 0.016 84.056 0.000 1.284 1.346
GRAFFITI 1.1988 0.015 80.734 0.000 1.170 1.228
EXTERIOR_CLADDING 1.2485 0.020 63.164 0.000 1.210 1.287
EXTERIOR_GROUNDS 1.3542 0.019 70.484 0.000 1.317 1.392
EXTERIOR_WALKWAYS 1.2181 0.019 65.329 0.000 1.182 1.255
BALCONY_GUARDS 0.9270 0.017 54.618 0.000 0.894 0.960
WATER_PEN_EXT_BLDG_ELEMENTS 1.2169 0.018 66.434 0.000 1.181 1.253
PARKING_AREA 1.0911 0.016 66.976 0.000 1.059 1.123
Omnibus: 253.878 Durbin-Watson: 1.864
Prob(Omnibus): 0.000 Jarque-Bera (JB): 579.714
Skew: -0.032 Prob(JB): 1.31e-126
Kurtosis: 4.115 Cond. No. 9.75e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.75e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

as you can see, the p-value of 'CONFIRMED_UNIT' and 'YEAR_REGISTERED' are over 0.05 therfore we will drop it.

In [48]:
eva_columns3 = df_clean[[              
'YEAR_BUILT',                    
'CONFIRMED_STOREYS',             
'NO_OF_AREAS_EVALUATED',         
'ENTRANCE_LOBBY',                
'ENTRANCE_DOORS_WINDOWS',        
'SECURITY',                      
'STAIRWELLS',                    
'LAUNDRY_ROOMS',                 
'INTERNAL_GUARDS_HANDRAILS',     
'GARBAGE_BIN_STORAGE_AREA',      
'INTERIOR_WALL_CEILING_FLOOR',   
'INTERIOR_LIGHTING_LEVELS',      
'GRAFFITI',                      
'EXTERIOR_CLADDING',             
'EXTERIOR_GROUNDS',              
'EXTERIOR_WALKWAYS',             
'BALCONY_GUARDS',               
'WATER_PEN_EXT_BLDG_ELEMENTS',   
'PARKING_AREA']]

In this model, the variable name is 3

In [49]:
X3 = eva_columns3
y3 = df_clean['SCORE']
In [50]:
X3_withconstant3 = sm.add_constant(X3)
In [51]:
# 1. Instantiate Model
myregression3 = sm.OLS(y3, X3_withconstant3)

# 2. Fit Model (this returns a seperate object with the parameters)
myregression_results3 = myregression3.fit()

# Looking at the summary
myregression_results3.summary()
Out[51]:
OLS Regression Results
Dep. Variable: SCORE R-squared: 0.989
Model: OLS Adj. R-squared: 0.989
Method: Least Squares F-statistic: 5.457e+04
Date: Wed, 06 Sep 2023 Prob (F-statistic): 0.00
Time: 21:24:40 Log-Likelihood: -16588.
No. Observations: 11152 AIC: 3.322e+04
Df Residuals: 11132 BIC: 3.336e+04
Df Model: 19
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -4.4001 1.236 -3.560 0.000 -6.823 -1.977
YEAR_BUILT 0.0027 0.001 4.177 0.000 0.001 0.004
CONFIRMED_STOREYS 0.0100 0.002 4.691 0.000 0.006 0.014
NO_OF_AREAS_EVALUATED -0.1235 0.009 -14.311 0.000 -0.140 -0.107
ENTRANCE_LOBBY 1.4657 0.020 72.435 0.000 1.426 1.505
ENTRANCE_DOORS_WINDOWS 1.2556 0.019 67.198 0.000 1.219 1.292
SECURITY 1.2847 0.015 85.402 0.000 1.255 1.314
STAIRWELLS 1.3694 0.019 73.241 0.000 1.333 1.406
LAUNDRY_ROOMS 1.3344 0.017 76.349 0.000 1.300 1.369
INTERNAL_GUARDS_HANDRAILS 1.3060 0.015 86.960 0.000 1.277 1.335
GARBAGE_BIN_STORAGE_AREA 1.3557 0.016 83.119 0.000 1.324 1.388
INTERIOR_WALL_CEILING_FLOOR 1.3236 0.019 71.359 0.000 1.287 1.360
INTERIOR_LIGHTING_LEVELS 1.3153 0.016 84.117 0.000 1.285 1.346
GRAFFITI 1.1984 0.015 80.981 0.000 1.169 1.227
EXTERIOR_CLADDING 1.2486 0.020 63.166 0.000 1.210 1.287
EXTERIOR_GROUNDS 1.3542 0.019 70.511 0.000 1.317 1.392
EXTERIOR_WALKWAYS 1.2177 0.019 65.312 0.000 1.181 1.254
BALCONY_GUARDS 0.9273 0.017 54.640 0.000 0.894 0.961
WATER_PEN_EXT_BLDG_ELEMENTS 1.2169 0.018 66.432 0.000 1.181 1.253
PARKING_AREA 1.0909 0.016 66.971 0.000 1.059 1.123
Omnibus: 254.512 Durbin-Watson: 1.863
Prob(Omnibus): 0.000 Jarque-Bera (JB): 582.448
Skew: -0.030 Prob(JB): 3.34e-127
Kurtosis: 4.118 Cond. No. 2.39e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.39e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

We thought about the combination of various variables. This time, let's check the degree of correlation and continue.

In [52]:
X3.corr()
Out[52]:
YEAR_BUILT CONFIRMED_STOREYS NO_OF_AREAS_EVALUATED ENTRANCE_LOBBY ENTRANCE_DOORS_WINDOWS SECURITY STAIRWELLS LAUNDRY_ROOMS INTERNAL_GUARDS_HANDRAILS GARBAGE_BIN_STORAGE_AREA INTERIOR_WALL_CEILING_FLOOR INTERIOR_LIGHTING_LEVELS GRAFFITI EXTERIOR_CLADDING EXTERIOR_GROUNDS EXTERIOR_WALKWAYS BALCONY_GUARDS WATER_PEN_EXT_BLDG_ELEMENTS PARKING_AREA
YEAR_BUILT 1.000000 0.366900 0.522412 0.171732 0.120779 0.090139 0.076234 0.150829 0.184785 0.063703 0.067745 0.135345 -0.026563 0.211636 0.108594 0.140493 0.028208 0.155352 0.115760
CONFIRMED_STOREYS 0.366900 1.000000 0.593097 0.219177 0.131271 0.120292 0.006160 0.157234 0.153452 0.044288 0.030435 0.142126 -0.119631 0.117091 0.088737 0.098031 0.027771 0.075040 0.007841
NO_OF_AREAS_EVALUATED 0.522412 0.593097 1.000000 0.304457 0.186586 0.179652 0.136259 0.275323 0.233998 0.116819 0.110421 0.215188 -0.011284 0.173537 0.157517 0.161837 0.001725 0.130761 0.116301
ENTRANCE_LOBBY 0.171732 0.219177 0.304457 1.000000 0.586762 0.495384 0.562646 0.565121 0.429285 0.457104 0.545580 0.517893 0.281387 0.442738 0.527497 0.479569 0.343083 0.389781 0.323259
ENTRANCE_DOORS_WINDOWS 0.120779 0.131271 0.186586 0.586762 1.000000 0.510583 0.447198 0.458404 0.387223 0.427268 0.488562 0.497602 0.276959 0.459706 0.521023 0.492676 0.344897 0.423843 0.351325
SECURITY 0.090139 0.120292 0.179652 0.495384 0.510583 1.000000 0.396753 0.419171 0.370863 0.418373 0.408332 0.503602 0.248762 0.344039 0.435027 0.401815 0.289303 0.382347 0.292383
STAIRWELLS 0.076234 0.006160 0.136259 0.562646 0.447198 0.396753 1.000000 0.505245 0.469083 0.422580 0.599374 0.461718 0.323344 0.385585 0.475031 0.427633 0.299292 0.389476 0.357707
LAUNDRY_ROOMS 0.150829 0.157234 0.275323 0.565121 0.458404 0.419171 0.505245 1.000000 0.381326 0.417581 0.484054 0.490469 0.236123 0.389613 0.465503 0.425406 0.288619 0.351577 0.329924
INTERNAL_GUARDS_HANDRAILS 0.184785 0.153452 0.233998 0.429285 0.387223 0.370863 0.469083 0.381326 1.000000 0.340644 0.354027 0.411449 0.167570 0.337728 0.359546 0.369165 0.271873 0.362661 0.282280
GARBAGE_BIN_STORAGE_AREA 0.063703 0.044288 0.116819 0.457104 0.427268 0.418373 0.422580 0.417581 0.340644 1.000000 0.399484 0.412616 0.242217 0.374213 0.478326 0.423085 0.323684 0.353745 0.349497
INTERIOR_WALL_CEILING_FLOOR 0.067745 0.030435 0.110421 0.545580 0.488562 0.408332 0.599374 0.484054 0.354027 0.399484 1.000000 0.489153 0.319970 0.393544 0.458141 0.414459 0.303679 0.377608 0.330263
INTERIOR_LIGHTING_LEVELS 0.135345 0.142126 0.215188 0.517893 0.497602 0.503602 0.461718 0.490469 0.411449 0.412616 0.489153 1.000000 0.227596 0.388030 0.461578 0.432152 0.312306 0.394972 0.341143
GRAFFITI -0.026563 -0.119631 -0.011284 0.281387 0.276959 0.248762 0.323344 0.236123 0.167570 0.242217 0.319970 0.227596 1.000000 0.225538 0.291444 0.244344 0.179703 0.234188 0.191078
EXTERIOR_CLADDING 0.211636 0.117091 0.173537 0.442738 0.459706 0.344039 0.385585 0.389613 0.337728 0.374213 0.393544 0.388030 0.225538 1.000000 0.458361 0.459323 0.413140 0.591237 0.340671
EXTERIOR_GROUNDS 0.108594 0.088737 0.157517 0.527497 0.521023 0.435027 0.475031 0.465503 0.359546 0.478326 0.458141 0.461578 0.291444 0.458361 1.000000 0.586838 0.352748 0.426948 0.400288
EXTERIOR_WALKWAYS 0.140493 0.098031 0.161837 0.479569 0.492676 0.401815 0.427633 0.425406 0.369165 0.423085 0.414459 0.432152 0.244344 0.459323 0.586838 1.000000 0.325823 0.423368 0.399761
BALCONY_GUARDS 0.028208 0.027771 0.001725 0.343083 0.344897 0.289303 0.299292 0.288619 0.271873 0.323684 0.303679 0.312306 0.179703 0.413140 0.352748 0.325823 1.000000 0.348556 0.251993
WATER_PEN_EXT_BLDG_ELEMENTS 0.155352 0.075040 0.130761 0.389781 0.423843 0.382347 0.389476 0.351577 0.362661 0.353745 0.377608 0.394972 0.234188 0.591237 0.426948 0.423368 0.348556 1.000000 0.340585
PARKING_AREA 0.115760 0.007841 0.116301 0.323259 0.351325 0.292383 0.357707 0.329924 0.282280 0.349497 0.330263 0.341143 0.191078 0.340671 0.400288 0.399761 0.251993 0.340585 1.000000
In [53]:
# Calculate the correlation matrix
correlation_matrix3 = X3.corr()

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Create a heatmap using seaborn
sns.heatmap(correlation_matrix3, annot=True, cmap='coolwarm', center=0)

plt.title('Correlation Heatmap of X4 Variables')
plt.show()

In the heatmap we can observer that there are some correlations that over 0.5. However we can't drop everything so we're going to delete only one 'ENTRANCE_LOBBY' that's highly relevant to several columns and try to do it again.

In this model, the variable name is 4

In [54]:
eva_columns4 = df_clean[[              
'YEAR_BUILT',                    
'CONFIRMED_STOREYS',             
'NO_OF_AREAS_EVALUATED',         
'ENTRANCE_DOORS_WINDOWS',        
'SECURITY',                      
'STAIRWELLS',                    
'LAUNDRY_ROOMS',                 
'INTERNAL_GUARDS_HANDRAILS',     
'GARBAGE_BIN_STORAGE_AREA',      
'INTERIOR_WALL_CEILING_FLOOR',   
'INTERIOR_LIGHTING_LEVELS',      
'GRAFFITI',                      
'EXTERIOR_CLADDING',             
'EXTERIOR_GROUNDS',              
'EXTERIOR_WALKWAYS',             
'BALCONY_GUARDS',               
'WATER_PEN_EXT_BLDG_ELEMENTS',   
'PARKING_AREA']]
In [55]:
X4 = eva_columns4
y4 = df_clean['SCORE']
In [56]:
X4_withconstant4 = sm.add_constant(X4)

# 1. Instantiate Model
myregression4 = sm.OLS(y4, X4_withconstant4)

# 2. Fit Model (this returns a seperate object with the parameters)
myregression_results4 = myregression4.fit()

# Looking at the summary
myregression_results4.summary()
Out[56]:
OLS Regression Results
Dep. Variable: SCORE R-squared: 0.984
Model: OLS Adj. R-squared: 0.984
Method: Least Squares F-statistic: 3.896e+04
Date: Wed, 06 Sep 2023 Prob (F-statistic): 0.00
Time: 21:24:41 Log-Likelihood: -18741.
No. Observations: 11152 AIC: 3.752e+04
Df Residuals: 11133 BIC: 3.766e+04
Df Model: 18
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -4.2248 1.499 -2.818 0.005 -7.164 -1.286
YEAR_BUILT 0.0021 0.001 2.659 0.008 0.001 0.004
CONFIRMED_STOREYS 0.0247 0.003 9.597 0.000 0.020 0.030
NO_OF_AREAS_EVALUATED -0.0579 0.010 -5.567 0.000 -0.078 -0.038
ENTRANCE_DOORS_WINDOWS 1.5276 0.022 68.804 0.000 1.484 1.571
SECURITY 1.3880 0.018 76.414 0.000 1.352 1.424
STAIRWELLS 1.6038 0.022 71.808 0.000 1.560 1.648
LAUNDRY_ROOMS 1.5422 0.021 73.750 0.000 1.501 1.583
INTERNAL_GUARDS_HANDRAILS 1.3464 0.018 73.963 0.000 1.311 1.382
GARBAGE_BIN_STORAGE_AREA 1.4362 0.020 72.767 0.000 1.397 1.475
INTERIOR_WALL_CEILING_FLOOR 1.4963 0.022 67.063 0.000 1.453 1.540
INTERIOR_LIGHTING_LEVELS 1.3809 0.019 72.935 0.000 1.344 1.418
GRAFFITI 1.2443 0.018 69.386 0.000 1.209 1.279
EXTERIOR_CLADDING 1.3158 0.024 54.941 0.000 1.269 1.363
EXTERIOR_GROUNDS 1.4684 0.023 63.247 0.000 1.423 1.514
EXTERIOR_WALKWAYS 1.2764 0.023 56.497 0.000 1.232 1.321
BALCONY_GUARDS 0.9960 0.021 48.461 0.000 0.956 1.036
WATER_PEN_EXT_BLDG_ELEMENTS 1.1711 0.022 52.740 0.000 1.128 1.215
PARKING_AREA 1.0413 0.020 52.750 0.000 1.003 1.080
Omnibus: 238.926 Durbin-Watson: 1.828
Prob(Omnibus): 0.000 Jarque-Bera (JB): 469.268
Skew: 0.129 Prob(JB): 1.26e-102
Kurtosis: 3.972 Cond. No. 2.39e+05


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.39e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

The linear regression model has a very accurate value with an r-squared of 0.984. However, the condition number is large, 2.39e+05. This might indicate that there are strong multicollinearity.

By inferring from the results, it is estimated that the model made a formula from the evaluation indicators and calculated. That's why you have to try another model analysis.

RandomForestRegression¶

In this model, the variable name is 5

In [57]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
In [58]:
# Assign x and y data
X5 = df_clean.drop(['SCORE', 'RESULTS_OF_SCORE'], axis=1)
y5 = df_clean['SCORE']

# We have object columns like 'PROPERTY_TYPE', 'WARDNAME' , etc. 
X5_encoded = pd.get_dummies(X5, columns=['PROPERTY_TYPE', 'WARDNAME', 'GRID'])

# Split test and train data.
X5_train, X5_test, y5_train, y5_test = train_test_split(X5_encoded, y5, test_size=0.2, random_state=42)
In [59]:
rf_model5 = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model5.fit(X5_train, y5_train)

y5_pred = rf_model5.predict(X5_test)

mae5 = mean_absolute_error(y5_test, y5_pred)
mse5 = mean_squared_error(y5_test, y5_pred)
rmse5 = np.sqrt(mse5)
r2_5 = r2_score(y5_test, y5_pred)

print("Mean Abolute Error", mae5)
print("Mean Squared Error:", mse5)
print("Root Mean Squared Error:", rmse5)
print("R square:", r2_5)
Mean Abolute Error 1.5330793366203497
Mean Squared Error: 4.212218735992828
Root Mean Squared Error: 2.0523690545301125
R square: 0.9605444873189599

The Mean Squared Error (MSE) value obtained from the model evaluation is 4.21. In the context of this problem where SCORE values range from 0 to 100, an MSE of 4.21 can be considered relatively low, indicating that our model's predictions are reasonably close to the actual values.

Additonally, the R square is 0.96 and it is reasonable number in machine learning. But there may be a more optimized model, so let's look for it.

In [60]:
n_estimators_range = range(10, 211, 25)

mse_scores = []
r2_scores =[]

for n_estimators in n_estimators_range:
    # make model
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)
    
    # model fit
    rf_model.fit(X5_train, y5_train)
    
    # evaluation
    y_pred = rf_model.predict(X5_test)
    mse = mean_squared_error(y5_test, y_pred)
    r2 = r2_score(y5_test, y_pred)
    mse_scores.append(mse)
    r2_scores.append(r2)
    
    print('*The number of e_estimator: ', n_estimators)
    print('Mean Squared Error: ', mse)
    print('R square:', r2)
    print('')
*The number of e_estimator:  10
Mean Squared Error:  5.425235320484088
R square: 0.949182258946832

*The number of e_estimator:  35
Mean Squared Error:  4.384696530337818
R square: 0.958928901750284

*The number of e_estimator:  60
Mean Squared Error:  4.2562310125006215
R square: 0.9601322278797632

*The number of e_estimator:  85
Mean Squared Error:  4.228498027945325
R square: 0.960392000506112

*The number of e_estimator:  110
Mean Squared Error:  4.237120292201176
R square: 0.9603112364532438

*The number of e_estimator:  135
Mean Squared Error:  4.193067359239644
R square: 0.960723876718159

*The number of e_estimator:  160
Mean Squared Error:  4.197450239522635
R square: 0.9606828226325514

*The number of e_estimator:  185
Mean Squared Error:  4.2190710812087735
R square: 0.9604803019547831

*The number of e_estimator:  210
Mean Squared Error:  4.196148905700035
R square: 0.9606950121213597

In [61]:
plt.figure(figsize=(12, 6))

# MSE 
plt.subplot(1, 2, 1)
plt.plot(n_estimators_range, mse_scores, marker='o', linestyle='-', color='blue')
plt.title('Mean Squared Error (MSE) vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('Mean Squared Error (MSE)')
plt.grid(True)

# R-squared 
plt.subplot(1, 2, 2)
plt.plot(n_estimators_range, r2_scores, marker='o', linestyle='-', color='green')
plt.title('R-squared vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('R-squared')
plt.grid(True)

plt.tight_layout()
plt.show()

Let's take a closer look at that between 125 and 175.

In [62]:
n_estimators_range = range(125, 176, 10)

mse_scores = []
r2_scores =[]

for n_estimators in n_estimators_range:
    # make model
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)
    
    # model fit
    rf_model.fit(X5_train, y5_train)
    
    # evaluation
    y_pred = rf_model.predict(X5_test)
    mse = mean_squared_error(y5_test, y_pred)
    r2 = r2_score(y5_test, y_pred)
    mse_scores.append(mse)
    r2_scores.append(r2)
    
    print('*The number of e_estimator: ', n_estimators)
    print('Mean Squared Error: ', mse)
    print('R square:', r2)
    print('')
*The number of e_estimator:  125
Mean Squared Error:  4.20691022142537
R square: 0.960594211746147

*The number of e_estimator:  135
Mean Squared Error:  4.193067359239644
R square: 0.960723876718159

*The number of e_estimator:  145
Mean Squared Error:  4.188086305229894
R square: 0.9607705338487509

*The number of e_estimator:  155
Mean Squared Error:  4.194283651377267
R square: 0.9607124838079453

*The number of e_estimator:  165
Mean Squared Error:  4.199505869172142
R square: 0.9606635677156569

*The number of e_estimator:  175
Mean Squared Error:  4.203051034129474
R square: 0.9606303604418429

In this model, the model name is 42 and we will use the variable named 41

145 seems most appropriate as n_estimators. Let's make a model and evaluate it by applying this parameter.

In [63]:
plt.figure(figsize=(12, 6))

# MSE 
plt.subplot(1, 2, 1)
plt.plot(n_estimators_range, mse_scores, marker='o', linestyle='-', color='blue')
plt.title('Mean Squared Error (MSE) vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('Mean Squared Error (MSE)')
plt.grid(True)

# R-squared 
plt.subplot(1, 2, 2)
plt.plot(n_estimators_range, r2_scores, marker='o', linestyle='-', color='green')
plt.title('R-squared vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('R-squared')
plt.grid(True)

plt.tight_layout()
plt.show()
In [64]:
rf_model6 = RandomForestRegressor(n_estimators=145, random_state=42)
rf_model6.fit(X5_train, y5_train)

y6_pred = rf_model6.predict(X5_test)

mae6 = mean_absolute_error(y5_test, y6_pred)
mse6 = mean_squared_error(y5_test, y6_pred)
rmse6 = np.sqrt(mse6)
r2_6 = r2_score(y5_test, y6_pred)

print("Mean Abolute Error", mae6)
print("Mean Squared Error:", mse6)
print("Root Mean Squared Error:", rmse6)
print("R square:", r2_6)
Mean Abolute Error 1.5289262585202243
Mean Squared Error: 4.188086305229894
Root Mean Squared Error: 2.046481445122309
R square: 0.9607705338487509

In regression modeling, a random forest regression is more suitable than linear regression. With an accuracy of 96.07% and a very low Mean Squared Error of 4.18, it performs exceptionally well.

Demo Model¶

From here, this model is to make a simple demo model. We will make a simple model with only six important columns. It is used for demonstration.

Let's find the five important columns.

In [65]:
feature_importances = rf_model6.feature_importances_
feature_names = X5_train.columns

# Gtop 20 features
top_20_indices = feature_importances.argsort()[-20:][::-1]

# feature names and importances
top_20_features = [feature_names[i] for i in top_20_indices]
top_20_importances = [feature_importances[i] for i in top_20_indices]

top_20_features.reverse()
top_20_importances.reverse()

plt.figure(figsize=(10, 8))
plt.barh(top_20_features, top_20_importances)
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Top 20 Random Forest Model Feature Importance")
plt.show()

The 6 above important columns are :

'ENTRANCE_LOBBY', 'EXTERIOR_GROUNDS', 'STAIRWELLS', 'INTERIOR_WALL_CEILING_FLOOR', 'INTERIOR_LIGHTING_LEVELS', 'WATER_PEN_EXT_BLDG_ELEMENTS'

In [66]:
# Assign x and y data
X7 = df_clean[['ENTRANCE_LOBBY', 'EXTERIOR_GROUNDS', 'STAIRWELLS', 'INTERIOR_WALL_CEILING_FLOOR', 'INTERIOR_LIGHTING_LEVELS', 'WATER_PEN_EXT_BLDG_ELEMENTS']]
y7 = df_clean['SCORE']

# Split test and train data.
X7_train, X7_test, y7_train, y7_test = train_test_split(X7, y7, test_size=0.3, random_state=42)
In [67]:
rf_model7 = RandomForestRegressor(n_estimators=145, random_state=42)
rf_model7.fit(X7_train, y7_train)

y7_pred = rf_model7.predict(X7_test)

mae7 = mean_absolute_error(y7_test, y7_pred)
mse7 = mean_squared_error(y7_test, y7_pred)
rmse7 = np.sqrt(mse7)
r2_7 = r2_score(y7_test, y7_pred)

print("Mean Abolute Error", mae7)
print("Mean Squared Error:", mse7)
print("Root Mean Squared Error:", rmse7)
print("R square:", r2_7)
Mean Abolute Error 2.6637812073054423
Mean Squared Error: 11.581661512335685
Root Mean Squared Error: 3.403184025634771
R square: 0.8912770053909287
In [68]:
n_estimators_range = range(10, 211, 25)

mse_scores = []
r2_scores =[]

for n_estimators in n_estimators_range:
    # make model
    rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42)
    
    # model fit
    rf_model.fit(X7_train, y7_train)
    
    # evaluation
    y_pred = rf_model.predict(X7_test)
    mse = mean_squared_error(y7_test, y_pred)
    r2 = r2_score(y7_test, y_pred)
    mse_scores.append(mse)
    r2_scores.append(r2)
    
    print('*The number of e_estimator: ', n_estimators)
    print('Mean Squared Error: ', mse)
    print('R square:', r2)
    print('')
*The number of e_estimator:  10
Mean Squared Error:  11.919804656364244
R square: 0.8881026823297553

*The number of e_estimator:  35
Mean Squared Error:  11.697772502541822
R square: 0.8901870119950078

*The number of e_estimator:  60
Mean Squared Error:  11.635432944103501
R square: 0.8907722253919632

*The number of e_estimator:  85
Mean Squared Error:  11.602684845435464
R square: 0.8910796485843222

*The number of e_estimator:  110
Mean Squared Error:  11.603302711354798
R square: 0.8910738483601535

*The number of e_estimator:  135
Mean Squared Error:  11.573660275985372
R square: 0.8913521170988347

*The number of e_estimator:  160
Mean Squared Error:  11.57595419446467
R square: 0.8913305829099627

*The number of e_estimator:  185
Mean Squared Error:  11.573992568441765
R square: 0.891348997699178

*The number of e_estimator:  210
Mean Squared Error:  11.577904987297
R square: 0.8913122698174615

In [69]:
plt.figure(figsize=(12, 6))

# MSE 
plt.subplot(1, 2, 1)
plt.plot(n_estimators_range, mse_scores, marker='o', linestyle='-', color='blue')
plt.title('Mean Squared Error (MSE) vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('Mean Squared Error (MSE)')
plt.grid(True)

# R-squared 
plt.subplot(1, 2, 2)
plt.plot(n_estimators_range, r2_scores, marker='o', linestyle='-', color='green')
plt.title('R-squared vs. n_estimators for Random Forest Model')
plt.xlabel('n_estimators')
plt.ylabel('R-squared')
plt.grid(True)

plt.tight_layout()
plt.show()

If we use only six(6) columns to predict, 127 is the best n_estimators. Let's make our model for demo.

In [70]:
rf_model8 = RandomForestRegressor(n_estimators=128, random_state=42)
rf_model8.fit(X7_train, y7_train)

y8_pred = rf_model8.predict(X7_test)

mae8 = mean_absolute_error(y7_test, y8_pred)
mse8 = mean_squared_error(y7_test, y8_pred)
rmse8 = np.sqrt(mse8)
r2_8 = r2_score(y7_test, y8_pred)

print("Mean Abolute Error", mae8)
print("Mean Squared Error:", mse8)
print("Root Mean Squared Error:", rmse8)
print("R square:", r2_8)
Mean Abolute Error 2.663881573123019
Mean Squared Error: 11.587857661382285
Root Mean Squared Error: 3.404094249779563
R square: 0.8912188389630225

Save my model with Pickle.

In [71]:
import pickle

# Load the model from the file
with open('../models/rf_model8.pkl', 'wb') as file:
    pickle.dump(rf_model8, file)

Classification Models¶

To use the classification models, we should check the target feature. The target variable must be categorical data.

  1. Decision Tree
  2. K-Nearest Neighbor

'RESULTS_OF_SCORE' column is our target column. Let's explore it first.

In [72]:
agg_func_math = {
    'SCORE':
    ['count', 'mean', 'median', 'min', 'max', 'std', 'var']
}
df_clean.groupby(['RESULTS_OF_SCORE']).agg(agg_func_math).round(2)
Out[72]:
SCORE
count mean median min max std var
RESULTS_OF_SCORE
Building Audit 98 45.79 47.0 20 50 4.36 19.02
Evaluation needs to be conducted in 1 year 2476 60.43 61.0 51 65 3.67 13.50
Evaluation needs to be conducted in 2 years 7077 75.43 76.0 66 85 5.42 29.37
Evaluation needs to be conducted in 3 years 1501 90.21 89.0 86 100 3.51 12.30

we have 4 levels in this column.

The greatest level is 'Evaluation needs to be conducted in 3 years' and the mean of it is 90.21. The 86 to 100 scores belong here. 89 points is the median.

'Evaluation needs to be conducted in 2 years' accounnt for the largest number as 7077. The mean is 75.43 and median is 76. The variance of it is the largest at 29.37.

'Building Audit' is the price we should pay the most attention to. Because our goal is to make predictions so that we can avoid this score. Points 20 to 50 are included here. The mean is 45.79 and the median is 47.0. </span>

In the machine learning every variable should be presented as numbers. So we will make a new column.

In [73]:
# create new column for categorical resuslt to change the value to numbers.
df_clean['RESULTS_CODE'] = df_clean['RESULTS_OF_SCORE']

# input the values depends on the 'RESULTS_OF_SCORE'
df_clean.loc[df_clean['RESULTS_OF_SCORE'] == 'Building Audit', 'RESULTS_CODE'] = 0
df_clean.loc[df_clean['RESULTS_OF_SCORE'] == 'Evaluation needs to be conducted in 1 year', 'RESULTS_CODE'] = 1
df_clean.loc[df_clean['RESULTS_OF_SCORE'] == 'Evaluation needs to be conducted in 2 years', 'RESULTS_CODE'] = 2
df_clean.loc[df_clean['RESULTS_OF_SCORE'] == 'Evaluation needs to be conducted in 3 years', 'RESULTS_CODE'] = 3
In [74]:
df_clean['RESULTS_CODE'] = df_clean['RESULTS_CODE'].astype(int)
In [75]:
df_clean['RESULTS_CODE'].value_counts()
Out[75]:
RESULTS_CODE
2    7077
1    2476
3    1501
0      98
Name: count, dtype: int64

Decision Tree¶

The name will be made starting with a unit of 20.

In [76]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import plot_tree

In this model, the variable name is 21

In [77]:
X21 = df_clean.drop(['SCORE', 'RESULTS_OF_SCORE', 'RESULTS_CODE'], axis=1)  # Features
y21 = df_clean['RESULTS_CODE']  # Target variable
In [78]:
X21_encoded = pd.get_dummies(X21, columns=['PROPERTY_TYPE', 'WARDNAME', 'GRID'])

X_train21, X_test21, y_train21, y_test21 = train_test_split(X21_encoded, y21, test_size=0.2, random_state=42)
In [79]:
dt_model21 = DecisionTreeClassifier(random_state=42, max_depth=3, min_samples_leaf=5)
dt_model21.fit(X_train21, y_train21)
Out[79]:
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)

< RESULTS_OF_SCORE >

0 : Building Audit
1 : Evaluation needs to be conducted in 1 year
2 : Evaluation needs to be conducted in 2 years
3 : Evaluation needs to be conducted in 3 years

In [80]:
y_pred21 = dt_model21.predict(X_test21)

accuracy = accuracy_score(y_test21, y_pred21)
report = classification_report(y_test21, y_pred21)

print(f"Accuracy: {accuracy}")
print("Report:\n", report)
Accuracy: 0.8014343343792022
Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00        17
           1       0.77      0.73      0.75       489
           2       0.81      0.92      0.86      1435
           3       0.80      0.39      0.53       290

    accuracy                           0.80      2231
   macro avg       0.59      0.51      0.53      2231
weighted avg       0.79      0.80      0.79      2231

In [81]:
percentage_0 = len(df_clean[df_clean['RESULTS_OF_SCORE'] == 'Building Audit'])/df_clean['RESULTS_OF_SCORE'].count()
print('percentage of Building Audit(0):', percentage_0)
percentage of Building Audit(0): 0.008787661406025825

The values of the most important 'Building Audit' for us all came out too small at 0.00. The reason is that the ratio of values is too small at 0.88%. That is why we have to upsampling.

Upsampling¶

In [82]:
from sklearn.utils import resample

In this model, the variable name is 22

In [83]:
X22 = df_clean.drop(['SCORE', 'RESULTS_OF_SCORE', 'RESULTS_CODE'], axis=1)  # Features
y22 = df_clean['RESULTS_CODE']  # Target variable

X22_encoded = pd.get_dummies(X22, columns=['PROPERTY_TYPE', 'WARDNAME', 'GRID'])

X_train22, X_test22, y_train22, y_test22 = train_test_split(X22_encoded, y22, test_size=0.2, random_state=42)

In addition, existing targets have a severe data imbalance. Let's solve this together and try it.

In [84]:
from imblearn.over_sampling import SMOTE

# SMOTE
smote = SMOTE(random_state=42)

print('The numbers of Building Audit(0) before:', X_train22[y_train22 == 0].shape[0])
print('The numbers of Building Audit(1) before:', X_train22[y_train22 == 1].shape[0])
print('The numbers of Building Audit(2) before:', X_train22[y_train22 == 2].shape[0])
print('The numbers of Building Audit(3) before:', X_train22[y_train22 == 3].shape[0])

# resampling
X_train_resampled22, y_train_resampled22 = smote.fit_resample(X_train22, y_train22)


print()
print('The numbers of Building Audit(0) after:', X_train_resampled22[y_train_resampled22 == 0].shape[0])
print('The numbers of Building Audit(1) after:', X_train_resampled22[y_train_resampled22 == 1].shape[0])
print('The numbers of Building Audit(2) after:', X_train_resampled22[y_train_resampled22 == 2].shape[0])
print('The numbers of Building Audit(3) after:', X_train_resampled22[y_train_resampled22 == 3].shape[0])
The numbers of Building Audit(0) before: 81
The numbers of Building Audit(1) before: 1987
The numbers of Building Audit(2) before: 5642
The numbers of Building Audit(3) before: 1211

The numbers of Building Audit(0) after: 5642
The numbers of Building Audit(1) after: 5642
The numbers of Building Audit(2) after: 5642
The numbers of Building Audit(3) after: 5642
In [85]:
dt_model22 = DecisionTreeClassifier(random_state=42, max_depth=3, min_samples_leaf=5)
dt_model22.fit(X_train_resampled22, y_train_resampled22)
Out[85]:
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)
In [86]:
y_pred22 = dt_model22.predict(X_test22)

accuracy22 = accuracy_score(y_test22, y_pred22)
report22 = classification_report(y_test22, y_pred22)

print(f"Accuracy: {accuracy22}")
print("Report:\n", report22)
Accuracy: 0.7709547288211565
Report:
               precision    recall  f1-score   support

           0       0.34      0.71      0.46        17
           1       0.68      0.76      0.72       489
           2       0.86      0.79      0.82      1435
           3       0.60      0.71      0.65       290

    accuracy                           0.77      2231
   macro avg       0.62      0.74      0.66      2231
weighted avg       0.79      0.77      0.78      2231

In [87]:
plt.figure(figsize=(10, 6))
plot_tree(dt_model22, 
          feature_names=dt_model22.feature_names_in_, 
          rounded=True,
          impurity=False,
          filled=True,
          fontsize=11);

In the above cell, we set max_depth as 3. However the accuracy will be changed when we set it differently so we should find what is best number to make 'Decision Tree'.

In [88]:
train_acc = []
test_acc = []

# loop for finding best max_depth
for max_depth in range(1,12):
    
    # initialize model
    dt_model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    
    # fit model
    dt_model.fit(X_train_resampled22, y_train_resampled22)
    
    # score model
    print('* The number of max_depth :', max_depth)
    print('* The number of actual depth:', dt_model.get_depth())
    print('Test accuracy:', dt_model.score(X_test22, y_test22))
    print('Train accuracy', dt_model.score(X_train_resampled22, y_train_resampled22))
    print('')
    
    test_acc.append(dt_model.score(X_test22, y_test22))
    train_acc.append(dt_model.score(X_train_resampled22, y_train_resampled22))
* The number of max_depth : 1
* The number of actual depth: 1
Test accuracy: 0.13536530703720304
Train accuracy 0.4869727047146402

* The number of max_depth : 2
* The number of actual depth: 2
Test accuracy: 0.3290004482294935
Train accuracy 0.6985554767812833

* The number of max_depth : 3
* The number of actual depth: 3
Test accuracy: 0.7709547288211565
Train accuracy 0.7706043956043956

* The number of max_depth : 4
* The number of actual depth: 4
Test accuracy: 0.7543702375616316
Train accuracy 0.8222704714640199

* The number of max_depth : 5
* The number of actual depth: 5
Test accuracy: 0.770058269834155
Train accuracy 0.8528890464374336

* The number of max_depth : 6
* The number of actual depth: 6
Test accuracy: 0.8095024652622143
Train accuracy 0.8663151364764268

* The number of max_depth : 7
* The number of actual depth: 7
Test accuracy: 0.8242940385477364
Train accuracy 0.8794753633463311

* The number of max_depth : 8
* The number of actual depth: 8
Test accuracy: 0.8072613177947109
Train accuracy 0.8974211272598369

* The number of max_depth : 9
* The number of actual depth: 9
Test accuracy: 0.8103989242492156
Train accuracy 0.912974122651542

* The number of max_depth : 10
* The number of actual depth: 10
Test accuracy: 0.8144329896907216
Train accuracy 0.9269319390287132

* The number of max_depth : 11
* The number of actual depth: 11
Test accuracy: 0.822052891080233
Train accuracy 0.9382754342431762

We will draw a plot to make it easy to recognize at a glance.

In [89]:
plt.figure(figsize=(10, 5))
plt.plot(range(1, 12), train_acc, marker="o", label="train accuracy")
plt.plot(range(1, 12), test_acc,  marker="o", label="test accuracy")
plt.xlabel("Max Depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

let's try max_depth as seven(6).

In [90]:
dt_model23 = DecisionTreeClassifier(random_state=42, max_depth=6, min_samples_leaf=5)
dt_model23.fit(X_train_resampled22, y_train_resampled22)

y_pred23 = dt_model23.predict(X_test22)

accuracy23 = accuracy_score(y_test22, y_pred23)
report23 = classification_report(y_test22, y_pred23)

print(f"Accuracy: {accuracy23}")
print("Report:\n", report23)
Accuracy: 0.8095024652622143
Report:
               precision    recall  f1-score   support

           0       0.86      0.71      0.77        17
           1       0.76      0.84      0.80       489
           2       0.92      0.78      0.84      1435
           3       0.58      0.92      0.71       290

    accuracy                           0.81      2231
   macro avg       0.78      0.81      0.78      2231
weighted avg       0.84      0.81      0.81      2231

let's try max_depth as seven(7).

In [91]:
dt_model24 = DecisionTreeClassifier(random_state=42, max_depth=7, min_samples_leaf=5)
dt_model24.fit(X_train_resampled22, y_train_resampled22)

y_pred24 = dt_model24.predict(X_test22)

accuracy24 = accuracy_score(y_test22, y_pred24)
report24 = classification_report(y_test22, y_pred24)

print(f"Accuracy: {accuracy24}")
print("Report:\n", report24)
Accuracy: 0.8256387270282385
Report:
               precision    recall  f1-score   support

           0       0.92      0.71      0.80        17
           1       0.84      0.82      0.83       489
           2       0.91      0.81      0.86      1435
           3       0.57      0.91      0.70       290

    accuracy                           0.83      2231
   macro avg       0.81      0.81      0.80      2231
weighted avg       0.85      0.83      0.83      2231

In [92]:
plt.figure(figsize=(10, 6))
plot_tree(dt_model23, 
          feature_names=dt_model23.feature_names_in_, 
          rounded=True,
          impurity=False,
          filled=True);

Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. It measures the accuracy of positive predictions.

Recall: Recall is the ratio of correctly predicted positive observations to all actual positive observations. It measures the model's ability to identify all relevant instances.

F1-Score: The F1-Score is the harmonic mean of precision and recall. It provides a balance between precision and recall and is useful when there is an uneven class distribution.

When Max-depth was specified as 7, it was well predicted that the precision of Building Audit(0) was 0.92, recall was 0.71 and f1-score was 0.80.

Demo Model¶

From here, this model is to make a simple demo model. We will make a simple model with only eight important columns. It is used for demonstration.

In [93]:
importance = dt_model24.feature_importances_
feature_names = X_train_resampled22.columns

# importnace
feature_importance = sorted(zip(importance, feature_names), reverse=True)

# top 20
top_20_features = feature_importance[:20]

sorted_features = [feature for _, feature in top_20_features]
sorted_importance = [importance for importance, _ in top_20_features]

# plot
plt.figure(figsize=(10, 6))
plt.barh(sorted_features, sorted_importance)
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.title('Top 20 Feature Importance in Decision Tree Model')
plt.gca().invert_yaxis()  # 변수 이름을 내림차순으로 표시
plt.show()
In [94]:
X23 = df_clean[['ENTRANCE_DOORS_WINDOWS',
                 'ENTRANCE_LOBBY',
                 'INTERIOR_WALL_CEILING_FLOOR',
                 'WATER_PEN_EXT_BLDG_ELEMENTS',
                 'EXTERIOR_GROUNDS',
                 'INTERIOR_LIGHTING_LEVELS',
                 'STAIRWELLS',
                 'SECURITY']] # Features
y23 = df_clean['RESULTS_CODE']  # Target variable

X_train23, X_test23, y_train23, y_test23 = train_test_split(X23, y23, test_size=0.3, random_state=42)
In [95]:
# SMOTE
smote = SMOTE(random_state=42)

print('The numbers of Building Audit(0) before:', X_train23[y_train23 == 0].shape[0])
print('The numbers of Building Audit(1) before:', X_train23[y_train23 == 1].shape[0])
print('The numbers of Building Audit(2) before:', X_train23[y_train23 == 2].shape[0])
print('The numbers of Building Audit(3) before:', X_train23[y_train23 == 3].shape[0])

# resampling
X_train_resampled23, y_train_resampled23 = smote.fit_resample(X_train23, y_train23)


print()
print('The numbers of Building Audit(0) after:', X_train_resampled23[y_train_resampled23 == 0].shape[0])
print('The numbers of Building Audit(1) after:', X_train_resampled23[y_train_resampled23 == 1].shape[0])
print('The numbers of Building Audit(2) after:', X_train_resampled23[y_train_resampled23 == 2].shape[0])
print('The numbers of Building Audit(3) after:', X_train_resampled23[y_train_resampled23 == 3].shape[0])
The numbers of Building Audit(0) before: 72
The numbers of Building Audit(1) before: 1745
The numbers of Building Audit(2) before: 4924
The numbers of Building Audit(3) before: 1065

The numbers of Building Audit(0) after: 4924
The numbers of Building Audit(1) after: 4924
The numbers of Building Audit(2) after: 4924
The numbers of Building Audit(3) after: 4924
In [96]:
dt_model23 = DecisionTreeClassifier(random_state=42, max_depth=3, min_samples_leaf=5)
dt_model23.fit(X_train_resampled23, y_train_resampled23)
Out[96]:
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, random_state=42)
In [97]:
y_pred23 = dt_model23.predict(X_test23)

accuracy23 = accuracy_score(y_test23, y_pred23)
report23 = classification_report(y_test23, y_pred23)

print(f"Accuracy: {accuracy23}")
print("Report:\n", report23)
Accuracy: 0.7429766885833832
Report:
               precision    recall  f1-score   support

           0       0.35      0.88      0.51        26
           1       0.60      0.83      0.70       731
           2       0.88      0.72      0.79      2153
           3       0.60      0.70      0.65       436

    accuracy                           0.74      3346
   macro avg       0.61      0.78      0.66      3346
weighted avg       0.78      0.74      0.75      3346

In [98]:
train_acc = []
test_acc = []

# loop for finding best max_depth
for max_depth in range(1,12):
    
    # initialize model
    dt_model = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    
    # fit model
    dt_model.fit(X_train_resampled23, y_train_resampled23)
    
    # score model
    print('* The number of max_depth :', max_depth)
    print('* The number of actual depth:', dt_model.get_depth())
    print('Test accuracy:', dt_model.score(X_test23, y_test23))
    print('Train accuracy', dt_model.score(X_train_resampled23, y_train_resampled23))
    print('')
    
    test_acc.append(dt_model.score(X_test23, y_test23))
    train_acc.append(dt_model.score(X_train_resampled23, y_train_resampled23))
* The number of max_depth : 1
* The number of actual depth: 1
Test accuracy: 0.13777644949193066
Train accuracy 0.497410641754671

* The number of max_depth : 2
* The number of actual depth: 2
Test accuracy: 0.3302450687387926
Train accuracy 0.6746547522339561

* The number of max_depth : 3
* The number of actual depth: 3
Test accuracy: 0.7429766885833832
Train accuracy 0.7944252640129975

* The number of max_depth : 4
* The number of actual depth: 4
Test accuracy: 0.7193664076509265
Train accuracy 0.8420999187652315

* The number of max_depth : 5
* The number of actual depth: 5
Test accuracy: 0.7393903167961745
Train accuracy 0.8623578391551584

* The number of max_depth : 6
* The number of actual depth: 6
Test accuracy: 0.7576210400478183
Train accuracy 0.8803310316815597

* The number of max_depth : 7
* The number of actual depth: 7
Test accuracy: 0.7904961147638971
Train accuracy 0.8956133225020309

* The number of max_depth : 8
* The number of actual depth: 8
Test accuracy: 0.7979677226539151
Train accuracy 0.9073415922014623

* The number of max_depth : 9
* The number of actual depth: 9
Test accuracy: 0.8156007172743575
Train accuracy 0.9214053614947197

* The number of max_depth : 10
* The number of actual depth: 10
Test accuracy: 0.8257621040047818
Train accuracy 0.930798131600325

* The number of max_depth : 11
* The number of actual depth: 11
Test accuracy: 0.8347280334728033
Train accuracy 0.9381600324939074

In [99]:
plt.figure(figsize=(10, 5))
plt.plot(range(1, 12), train_acc, marker="o", label="train accuracy")
plt.plot(range(1, 12), test_acc,  marker="o", label="test accuracy")
plt.xlabel("Max Depth")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

Depth 3 is also most appropriate in the model for demonstration.

Let's save with pickle

In [100]:
import pickle

# Load the model from the file
with open('../models/dt_model23.pkl', 'wb') as file:
    pickle.dump(dt_model23, file)
In [ ]:
 

K-Nearest Neighbor¶

The name will be made starting with a unit of 30.

In [101]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In this model, the model name is 31 and the variable name is 31

In [102]:
X31 = df_clean.drop(['SCORE', 'RESULTS_OF_SCORE', 'RESULTS_CODE'], axis=1)
y31 = df_clean['RESULTS_CODE']

# we have object columns like 'PROPERTY_TYPE', 'WARDNAME' , etc.
X31_encoded = pd.get_dummies(X31, columns=['PROPERTY_TYPE', 'WARDNAME', 'GRID'])

KNN can be sensitive to a feature's unit scale. This required us to normalize the data before using KNN classifiers.

In [103]:
y31
Out[103]:
1        2
2        2
3        2
4        2
5        2
        ..
11755    2
11756    2
11757    2
11758    2
11759    2
Name: RESULTS_CODE, Length: 11152, dtype: int64
In [104]:
from sklearn.preprocessing import StandardScaler

X31 = StandardScaler().fit_transform(X31_encoded)
In [105]:
# Split the data
X_train31, X_test31, y_train31, y_test31 = train_test_split(X31_encoded, y31, test_size=0.3, random_state=1)
In [106]:
from imblearn.over_sampling import SMOTE

# SMOTE
smote = SMOTE(random_state=42)

print('The numbers of Building Audit(0) before:', X_train31[y_train31 == 0].shape[0])
print('The numbers of Building Audit(1) before:', X_train31[y_train31 == 1].shape[0])
print('The numbers of Building Audit(2) before:', X_train31[y_train31 == 2].shape[0])
print('The numbers of Building Audit(3) before:', X_train31[y_train31 == 3].shape[0])

# resampling
X_train_resampled31, y_train_resampled31 = smote.fit_resample(X_train31, y_train31)


print()
print('The numbers of Building Audit(0) after:', X_train_resampled31[y_train_resampled31 == 0].shape[0])
print('The numbers of Building Audit(1) after:', X_train_resampled31[y_train_resampled31 == 1].shape[0])
print('The numbers of Building Audit(2) after:', X_train_resampled31[y_train_resampled31 == 2].shape[0])
print('The numbers of Building Audit(3) after:', X_train_resampled31[y_train_resampled31 == 3].shape[0])
The numbers of Building Audit(0) before: 71
The numbers of Building Audit(1) before: 1746
The numbers of Building Audit(2) before: 4949
The numbers of Building Audit(3) before: 1040

The numbers of Building Audit(0) after: 4949
The numbers of Building Audit(1) after: 4949
The numbers of Building Audit(2) after: 4949
The numbers of Building Audit(3) after: 4949
In [107]:
# Instantiate the model & fit it to our data
KNN_model31 = KNeighborsClassifier(n_neighbors=3)
KNN_model31.fit(X_train31, y_train31)
Out[107]:
KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=3)
In [108]:
# Score the model on the test set
train_predictions31 = KNN_model31.predict(X_train31)
test_predictions31 = KNN_model31.predict(X_test31)

train_accuracy31 = accuracy_score(train_predictions31, y_train31)
test_accuracy31 = accuracy_score(test_predictions31, y_test31)

print(f"Train set accuracy: {train_accuracy31}")
print(f"Test set accuracy: {test_accuracy31}")
Train set accuracy: 0.7950294645144761
Test set accuracy: 0.6413628212791392

When we put three(3) for n_neighbors parameter, It looks over-fitting because the train set accuracy is higher than the test set accuracy.

So, let's plot the relation between the accuracy and the number of neignbors to try the other numbers.

In [109]:
n_neighbor = range(1, 60, 10)

train_accuracy = []
test_accuracy = []

for n in n_neighbor:
    
    # Instantiate the model & fit it to our data
    KNN_model = KNeighborsClassifier(n_neighbors=n)
    KNN_model.fit(X_train31, y_train31)
    
    # Score the model and apppend in to the lists
    train_accuracy.append(KNN_model.score(X_train31, y_train31))
    test_accuracy.append(KNN_model.score(X_test31, y_test31))
In [110]:
plt.figure(figsize=(10,7))

plt.plot(n_neighbor, train_accuracy, color='b', label = 'train')
plt.plot(n_neighbor, test_accuracy, color='r', label = 'test')

plt.xlabel('K for KNN')
plt.ylabel('accuracy')
plt.legend()

plt.title('Accuracy of KNN vs. number of neibors')

plt.show()

The number 43 seems appropriate. Let's try 43.

In this model, the model name is 32 and we will use the variable named 31

In [111]:
# Instantiate the model & fit it to our data with parameter 43
KNN_model32 = KNeighborsClassifier(n_neighbors=43)
KNN_model32.fit(X_train31, y_train31)

# Score the model on the test set
train_predictions32 = KNN_model32.predict(X_train31)
test_predictions32 = KNN_model32.predict(X_test31)

train_accuracy32 = accuracy_score(train_predictions32, y_train31)
test_accuracy32 = accuracy_score(test_predictions32, y_test31)

print(f"Train set accuracy: {train_accuracy32}")
print(f"Test set accuracy: {test_accuracy32}")
Train set accuracy: 0.683192416090187
Test set accuracy: 0.682904961147639

As we review the number 43 is best for our model. According the result, our model predict with a probability of 68%.

In [112]:
from sklearn.model_selection import cross_val_score

# Instantiate the KNN model with the desired number of neighbors
KNN_model33 = KNeighborsClassifier(n_neighbors=43)

# Perform 5-Fold Cross Validation
cross_val_scores33 = cross_val_score(KNN_model33, X_train31, y_train31, cv=5)

# Print the cross-validation scores
print("Cross-Validation Scores:", cross_val_scores33)

# Calculate and print the mean and standard deviation of the cross-validation scores
mean_cv_score33 = cross_val_scores33.mean()
std_cv_score33 = cross_val_scores33.std()
print(f"Mean Cross-Validation Score: {mean_cv_score33}")
print(f"Standard Deviation of Cross-Validation Scores: {std_cv_score33}")
Cross-Validation Scores: [0.66965429 0.66816143 0.67905189 0.66752082 0.66880205]
Mean Cross-Validation Score: 0.6706380968239112
Standard Deviation of Cross-Validation Scores: 0.0042657268177913295

We attempted cross-validation to improve accuracy, but it did not lead to improvement. This reason is assumed to be due to the fact that the model is not appropriate for our data or is too simple.

Conclusion¶

In the data, one sample scored 3 points in all evaluations, with an evaluation score of 100. This is not just a calculation of the evaluation index, but is predicted to have been evaluated in consideration of the registration year, building year, and location. However, in the regression model, machine learning seems to have created its own formula. Because the accuracy is so high that it's hard to see it as a prediction.

Therefore, in this case, it can be said that the classification model is suitable for prediction. Among them, the decision tree is the most accurate model, so it is suitable for us.